3 Description of the adopted methodology

The adopted methodolgy consists of data preparation, feature engineering and modelling. In the following section these parts are described in more details.

3.1 Preprocessing

In the preprocessing part we are going to deal with the outliers and NA values of each building and the weather data.

3.1.1 Outliers, NAs

3.1.1.1 Building 2

We’ve seen in the previous chapter, that building 2 has some extreme outliers.

Let’s plot both the energy produced and the sun radiation together:

building_2 %>%
  plot_ly(
    x=~timestamp,
    y=~energy_produced,
    type="scatter",
    mode="lines",
    name="enery produced"
  ) %>%
  add_trace(
    inherit = F,
    data=building_2_sun,
    x=~timestamp,
    y=~sun,
    type="scatter",
    mode="lines",
    opacity=0.5,
    name="sun"
  )

We can see that there is nothing extreme anomaly in the sunshine, therefore those outliers cannot be explained and most likely are wrong values.

As on the other building the range of produced energy is between 0 and 100, we have decided to cut off the outliers here based on this criteria.

building_2 <- building_2[energy_produced < 100 & energy_produced >= 0]

3.1.1.2 Weather

In order to be able to merge the data later on, we have to change the type of date_time column to POSIXct:

weather[,date_time := as.POSIXct(date_time, format = "%d/%m/%Y")] # 31/08/2016

We notice that there is a column called “location.” This is due to the Weather API support for downloading data from multiple locations in the same dataset. As the location is always the same (Vienna), we decide to drop it.

weather[,location := NULL]

3.1.2 Merging the data

We have to merge each of our building datasets with the weather data we acquired. We do this in order to be able to use features from the weather data for our forecasting. As we have 2 datasets per each building, one with sun data and the other with energy data, we first have to merge them and then we can add the weather data. We merge the datasets based on the timestamp. This is why we first need to aggregate building data daily.

The process for every building is as follows:

3.1.2.1 Building 2

b2_joined <- left_join(building_2, building_2_sun, 
                       by = c("timestamp" = "timestamp"))
summary(b2_joined)
##    timestamp                   energy_produced      sun           
##  Min.   :2016-08-09 00:15:00   Min.   :0.000   Min.   :  -0.0867  
##  1st Qu.:2017-04-29 01:45:00   1st Qu.:0.000   1st Qu.:   0.0000  
##  Median :2018-01-17 04:00:00   Median :0.000   Median :   0.1733  
##  Mean   :2018-01-18 21:45:30   Mean   :0.457   Mean   : 131.1446  
##  3rd Qu.:2018-10-10 22:30:00   3rd Qu.:0.534   3rd Qu.: 177.2333  
##  Max.   :2019-07-01 00:00:00   Max.   :3.760   Max.   :1089.6600
# aggregate daily
building2 <- b2_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
                       by=.(Day=floor_date(timestamp, "days"))]
summary(building2)
##       Day                      daily_total_sun daily_total_energy
##  Min.   :2016-08-09 00:00:00   Min.   :    0   Min.   :  0.00    
##  1st Qu.:2017-04-29 06:00:00   1st Qu.: 4460   1st Qu.: 11.16    
##  Median :2018-01-17 12:00:00   Median :10695   Median : 40.00    
##  Mean   :2018-01-18 22:01:08   Mean   :12569   Mean   : 43.80    
##  3rd Qu.:2018-10-10 18:00:00   3rd Qu.:20393   3rd Qu.: 76.49    
##  Max.   :2019-07-01 00:00:00   Max.   :32033   Max.   :110.75
building2_weather <- left_join(building2, weather, 
                               by = c("Day" = "date_time"))

3.1.2.2 Building 5

b5_joined <- left_join(building_5, building_5_sun, 
                       by = c("timestamp" = "timestamp"))
b5_joined <- na.omit(b5_joined) # remove the last day --> bc it has NAs
summary(b5_joined)
##    timestamp                   energy_produced       sun          
##  Min.   :2016-08-09 00:05:00   Min.   : 0.000   Min.   :  -1.056  
##  1st Qu.:2017-04-18 01:56:15   1st Qu.: 0.000   1st Qu.:   1.869  
##  Median :2017-12-26 03:47:30   Median : 0.000   Median :  12.919  
##  Mean   :2017-12-26 03:47:30   Mean   : 1.591   Mean   : 146.211  
##  3rd Qu.:2018-09-04 05:38:45   3rd Qu.: 2.032   3rd Qu.: 180.131  
##  Max.   :2019-05-14 07:30:00   Max.   :18.560   Max.   :1286.025
# aggregate daily
building5 <- b5_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
                       by=.(Day=floor_date(timestamp, "days"))]
summary(building5)
##       Day             daily_total_sun     daily_total_energy
##  Min.   :2016-08-09   Min.   :   -37.57   Min.   :   0.016  
##  1st Qu.:2017-04-18   1st Qu.: 15909.80   1st Qu.: 141.027  
##  Median :2017-12-26   Median : 34925.39   Median : 388.544  
##  Mean   :2017-12-26   Mean   : 42080.02   Mean   : 457.755  
##  3rd Qu.:2018-09-04   3rd Qu.: 67842.69   3rd Qu.: 758.336  
##  Max.   :2019-05-14   Max.   :110381.37   Max.   :1260.672
building5_weather <- left_join(building5, weather, 
                               by = c("Day" = "date_time"))

3.1.2.3 Building 8

b8_joined <- left_join(building_8, building_8_sun, 
                       by = c("timestamp" = "timestamp"))
summary(b8_joined)
##    timestamp                   energy_produced       sun          
##  Min.   :2016-08-09 00:05:00   Min.   :0.0000   Min.   :   3.354  
##  1st Qu.:2017-04-30 00:03:45   1st Qu.:0.0000   1st Qu.:   3.519  
##  Median :2018-01-19 00:02:30   Median :0.0000   Median :   3.675  
##  Mean   :2018-01-19 00:02:30   Mean   :0.3009   Mean   : 132.371  
##  3rd Qu.:2018-10-10 00:01:15   3rd Qu.:0.3760   3rd Qu.: 177.771  
##  Max.   :2019-07-01 00:00:00   Max.   :2.3920   Max.   :1068.678
# aggregate daily
building8 <- b8_joined[, .(daily_total_sun=sum(sun), daily_total_energy=sum(energy_produced)),
                       by=.(Day=floor_date(timestamp, "days"))]
summary(building8)
##       Day             daily_total_sun     daily_total_energy
##  Min.   :2016-08-09   Min.   :     3.66   Min.   :  0.00    
##  1st Qu.:2017-04-30   1st Qu.: 14104.25   1st Qu.: 29.45    
##  Median :2018-01-19   Median : 32361.41   Median : 79.45    
##  Mean   :2018-01-19   Mean   : 38086.76   Mean   : 86.58    
##  3rd Qu.:2018-10-10   3rd Qu.: 61039.17   3rd Qu.:142.61    
##  Max.   :2019-07-01   Max.   :151064.61   Max.   :214.67
building8_weather <- left_join(building8, weather, 
                               by = c("Day" = "date_time"))

This is how the dataset looks now. It has the amount of sun and energy produced and all other variables from the weather data.

3.2 Feature engineering

There are some variables which wouldn’t be important for our forecast. As we want to predict energy which highly depends on amount of sun during the day, features like wind temperature, humidity, pressure, visibility, moon illumination and moonrise are not really relevant in this case. Based on our understanding of the domain, important predictors for the energy would be sun, uv index, sun hour and cloud cover because these are the variables which may influence the amount of sun and at the same time the amount of generated energy. Therefore, we decide to keep only these features and discard the others.

Building 2:

# final building 2 aggregated weekly
building2 <- building2_weather[, .(sun=sum(daily_total_sun),
                                   energy=sum(daily_total_energy),
                                   sunHour=sum(sunHour),
                                   uvIndex=sum(uvIndex),
                                   cloudcover=sum(cloudcover)),
                       by=.(Week=floor_date(Day, "weeks"))]
building2
##            Week       sun   energy sunHour uvIndex cloudcover
##   1: 2016-08-07  63058.36 235.3050    62.0      22        253
##   2: 2016-08-14 147175.98 565.9120    99.9      38        188
##   3: 2016-08-21 137723.81 538.1480    84.9      38        145
##   4: 2016-08-28 135092.62 538.9766    78.8      41        108
##   5: 2016-09-04 114039.42 472.4040    74.8      37        203
##  ---                                                         
## 148: 2019-06-02 186802.58 638.3960    98.8      39        113
## 149: 2019-06-09 196700.09 662.3279    99.7      46         76
## 150: 2019-06-16 176269.59 606.3799   101.5      40        148
## 151: 2019-06-23 190691.92 634.7160    99.7      42        107
## 152: 2019-06-30  31125.16 103.2520    29.0      13         15

Building 5:

# final building 5 aggregated weekly
building5 <- building5_weather[, .(sun=sum(daily_total_sun),
                                   energy=sum(daily_total_energy),
                                   sunHour=sum(sunHour),
                                   uvIndex=sum(uvIndex),
                                   cloudcover=sum(cloudcover)),
                       by=.(Week=floor_date(Day, "weeks"))]
building5
##            Week       sun   energy sunHour uvIndex cloudcover
##   1: 2016-08-07 202677.93 2865.728    62.0      22        253
##   2: 2016-08-14 479672.57 6170.416    99.9      38        188
##   3: 2016-08-21 447513.56 5510.431    84.9      38        145
##   4: 2016-08-28 429050.96 5751.301    78.8      41        108
##   5: 2016-09-04 342232.86 4642.368    74.8      37        203
##  ---                                                         
## 141: 2019-04-14 532379.32 6168.128    82.9      28        182
## 142: 2019-04-21 505690.77 5872.640    92.2      32        239
## 143: 2019-04-28 372687.61 4444.544    86.6      23        359
## 144: 2019-05-05 401756.42 4984.448    88.0      23        365
## 145: 2019-05-12  55829.38  702.592    33.4       8        217

Building 8:

# final building 8 aggregated weekly
building8 <- building8_weather[, .(sun=sum(daily_total_sun),
                                   energy=sum(daily_total_energy),
                                   sunHour=sum(sunHour),
                                   uvIndex=sum(uvIndex),
                                   cloudcover=sum(cloudcover)),
                       by=.(Week=floor_date(Day, "weeks"))]
building8
##            Week       sun   energy sunHour uvIndex cloudcover
##   1: 2016-08-07 195254.33  473.506    62.0      22        253
##   2: 2016-08-14 442996.59 1052.572    99.9      38        188
##   3: 2016-08-21 415648.10  986.530    84.9      38        145
##   4: 2016-08-28 404495.16  954.740    78.8      41        108
##   5: 2016-09-04 343209.16  831.634    74.8      37        203
##  ---                                                         
## 148: 2019-06-02 561870.35 1203.696    98.8      39        113
## 149: 2019-06-09 591717.49 1262.832    99.7      46         76
## 150: 2019-06-16 364706.14 1150.704   101.5      40        148
## 151: 2019-06-23 569457.65 1186.232    99.7      42        107
## 152: 2019-06-30  93400.23  192.656    29.0      13         15

Creating ts objects so that we can use them for modelling:

building_2_ts <- ts(building2, frequency = 52, start = c(2016,8))
building_5_ts <- ts(building5, frequency = 52, start = c(2016,8))
building_8_ts <- ts(building8, frequency = 52, start = c(2016,8))
ggAcf(building_8_ts)

We can see that there is a strong seasonality, as lag 0, 52 and 104, so every 52th lag is very high above the line. That means that we have a strong correlation of the previous years. If we think of how the sesons change in each year, this indeed makes a lot of sense.

3.2.1 Time series and decomposing

# Weekly data aggregated
building_2_sun_weekly <- building_2_sun[, .(weekly_sum=sum(sun)),
                                       by=.(Week=floor_date(timestamp, "weeks"))]
building_2_weekly <- building_2[, .(weekly_sum=sum(energy_produced)),
                               by=.(Week=floor_date(timestamp, "weeks"))]

building_5_sun_weekly <- building_5_sun[, .(weekly_sum=sum(sun)),
                                       by=.(Week=floor_date(timestamp, "weeks"))]
building_5_weekly <- building_5[, .(weekly_sum=sum(energy_produced)),
                               by=.(Week=floor_date(timestamp, "weeks"))]

building_8_sun_weekly <- building_8_sun[, .(weekly_sum=sum(sun)),
                                       by=.(Week=floor_date(timestamp, "weeks"))]
building_8_weekly <- building_8[, .(weekly_sum=sum(energy_produced)),
                               by=.(Week=floor_date(timestamp, "weeks"))]

Here we create time series objects out of monthly and weekly aggregated data and decompose them in order to see their components (trend, data, seasonal component and remainder)

building_2_weekly_ts = ts(building_2_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_2_weekly_ts, type="additive"))

building_5_weekly_ts = ts(building_5_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_5_weekly_ts, type="additive"))

building_8_weekly_ts = ts(building_8_weekly[, "weekly_sum"], frequency = 365.25/7, start = c(2016,8))
autoplot(decompose(building_8_weekly_ts, type="additive"))

The seasonal component shows us how with the time data gets to the same point every month/week, and from the seasonal component the future models will be built upon.

The remainders (noise) of these time series seem to be normally distributed (white noice), which indicates that future models will potentially learn from the data and the data is not bad.

The trends of the time series are not that smooth in general, which means the predictive power of the models in the future might be quite low, however by combining both seasonal and trend components the predictive power will increase in anyway, since these both are the most important for the future models.

3.2.2 Checking for stationarity

Stationarity basically means that the statistical properties of the time series data do not change over time, what is important for many analytical tools and future modeling.

adf.test(building_2_weekly_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  building_2_weekly_ts
## Dickey-Fuller = -3.4459, Lag order = 5, p-value = 0.04975
## alternative hypothesis: stationary
adf.test(building_5_weekly_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  building_5_weekly_ts
## Dickey-Fuller = -2.9694, Lag order = 5, p-value = 0.1728
## alternative hypothesis: stationary
adf.test(building_8_weekly_ts)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  building_8_weekly_ts
## Dickey-Fuller = -2.6659, Lag order = 5, p-value = 0.2989
## alternative hypothesis: stationary

The H0 here is that the time series is non-stationary, and the alternative hypothesis is that it is stationary. Since the p-value is only smaller than 0.05 for building 2, we can reject H0 and conclude that this time series is stationary.

However, building 5 and 8 are above 0.05, but the nature of the three buildings are similar to each other. As we are going to use data from other sources we will try out modeling with non-stationary data and see how it performs.

3.3 Modeling

For the forecasting we are going to try out different models first on one building, which is building 2. The models we are going to compare are:

  • basic arima
  • sarima
  • holt winters
  • arima with external regressors

After we have seen how they perform on one building, we will plot the residuals to see which one performs the best on a longer time span. Then the best model is going to be used to forecast the enrgy production of the other building as well.

3.3.0.1 Split into train/test

# building 2
training_b2 <- window(building_2_ts, start=c(2016,8),end= c(2018,11))
testing_b2 <- window(building_2_ts, start=c(2018,12) )

# building 5
training_b5 <- window(building_5_ts, start=c(2016,8),end= c(2018,11))
testing_b5 <- window(building_5_ts, start=c(2018,12) )

# building 8
training_b8 <- window(building_8_ts, start=c(2016,8),end= c(2018,11))
testing_b8 <- window(building_8_ts, start=c(2018,12) )

3.3.1 Arima

We first try out the most basic arima model on building 2:

arima_basic <- auto.arima(training_b2[,"energy"])
plot(forecast(arima_basic, h=52))
lines(testing_b2[,"energy"], col="red")

It is not too bad, but also not so good. There is a point at which it is completely unable to capture amount of sun.

3.3.2 Sarima

We try out a Sarima model:

sarima_forecast <- sarima.for(training_b2[,"energy"], n.ahead = length(testing_b2[,"energy"]),
 p=0,d=1,q=1, P=1, D=1, Q=0, S=12)

3.3.3 Holt winters

# Building 2
autoplot(building_2_ts[,"energy"])

building_2_ts.fc <- HoltWinters(training_b2[,"energy"])
#building_2_weekly_ts.fc$fitted
plot(building_2_ts.fc)

plot(forecast(building_2_ts.fc, h=52))
lines(testing_b2[,"energy"], col="red")

3.3.4 Arima including external regressors

We decide to include also other variables to our model. As discussed in Feature Engineering section, we chose sun, sunHour, uvIndex and cloudcover, as we believe they may be relevant for our prediction.

arima_b2 <- auto.arima(training_b2[,"energy"], xreg=training_b2[,c("sun","sunHour", "uvIndex", "cloudcover")])

plot(forecast(arima_b2, xreg=testing_b2[,c("sun", "sunHour", "uvIndex", "cloudcover")]))
lines(testing_b2[,"energy"], col="red")